Goto

Collaborating Authors

 directional smoothness


Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Neural Information Processing Systems

We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on $L$-smoothness.




Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

arXiv.org Artificial Intelligence

This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original discrete dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the step size. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.




Directional Smoothness and Gradient Methods: Convergence and Adaptivity

arXiv.org Artificial Intelligence

One way to avoid global smoothness of f is to use local Lipschitz continuity of the gradient ("local smoothness"). Local We develop new sub-optimality bounds for gradient smoothness uses different Lipschitz constants for different descent (GD) that depend on the conditioning neighbourhoods, thus avoiding global assumptions and obtaining of the objective along the path of optimization, improved rates. However, such analyses typically require rather than on global, worst-case constants. Key the iterates to be bounded, in which case local smoothness to our proofs is directional smoothness, a measure reduces to L-smoothness over a compact set (Malitsky of gradient variation that we use to develop upperbounds & Mishchenko, 2020). Boundedness can be enforced in a on the objective. Minimizing these upperbounds variety of ways: Zhang & Hong (2020) break optimization requires solving implicit equations to obtain into stages, Patel & Berahas (2022) develop a stopping-time a sequence of strongly adapted step-sizes; framework, and Lu & Mei (2023) use line-search and a modified we show that these equations are straightforward update. These approaches either modify the underlying to solve for convex quadratics and lead to new optimization algorithm, require local smoothness oracles guarantees for two classical step-sizes. For general (Park et al., 2021), or rely on highly complex arguments.


Linear attention is (maybe) all you need (to understand transformer optimization)

arXiv.org Artificial Intelligence

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training transformers by carefully studying a simple yet canonical linearized shallow transformer model. Specifically, we train linear transformers to solve regression tasks, inspired by J. von Oswald et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized transformer model could actually be a valuable, realistic abstraction for understanding transformer optimization.


Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making

arXiv.org Artificial Intelligence

The online learning setting has become the most popular regime for studying sequential decision making with dependent and potentially adversarial data. While this paradigm is attractive due to its great generality and minimal set of assumptions [Cesa-Bianchi and Lugosi, 2006], the worstcase nature of the adversary creates statistical and computational challenges [Rakhlin et al., 2015, Littlestone, 1988, Hazan and Koren, 2016]. In order to mitigate these difficulties, Rakhlin et al. [2011] proposed the smoothed setting, wherein the adversary is constrained to sample data from a distribution whose likelihood ratio is bounded above by 1/σ with respect to a fixed dominating measure, which ensures that the adversary cannot choose worst-case inputs with high probability. As in other online learning settings, performance is measured via regret with respect to a best-inhindsight comparator [Cesa-Bianchi and Lugosi, 2006]. Recent works have demonstrated strong computational-statistical tradeoffs in smoothed online learning: while there are statisticaly efficient algorithms that can enjoy regret logarithmic in 1/σ, oracle-efficient algorithms necessarily suffer regret scaling polynomially in 1/σ [Haghtalab et al., 2022a,b, Block et al., 2022], where the learner is assumed access to an Empirical Risk Minimization (ERM) oracle that is able to efficiently optimize functionals on the parameter space. This gap is significant, because in many applications of interest, the natural scaling of σ is exponential in ambient problem dimension [Block and Simchowitz, 2022]. A natural question remains: under which types of smoothing is it possible to design oracleefficient algorithms with regret that scales polynomially in problem dimension? A partial answer was provided by Block and Simchowitz [2022], who demonstrate an efficient algorithm based on the John Ellipsoid which attains log(T/σ) poly(dimension)-regret for noiseless linear classification, and for a suitable generalization to classification with polynomial features.


Smoothed Online Learning for Prediction in Piecewise Affine Systems

arXiv.org Artificial Intelligence

The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.